Background - As artificial intelligence (AI) is increasingly being integrated into hematology, concerns about fairness and equitable performance have come to the forefront. While these models offer powerful predictive capabilities, they can inadvertently perpetuate existing disparities if not properly validated across diverse populations. In a disease like acute myeloid leukemia (AML), where treatment decisions are time sensitive, it is crucial to ensure that prediction tools work equally well for all patients regardless of race or gender. In this study, we used a national cancer registry to assess the performance and fairness of a machine learning (ML) model developed to predict 1-year overall survival in newly diagnosed AML patients.

Method - We conducted a retrospective cohort study using the SEER database from 2010 to 2020. We included adult patients (age ≥18) with a new diagnosis of AML, excluding those with missing demographic or survival data. Key variables extracted included age, race/ethnicity, sex, year of diagnosis, and receipt of chemotherapy. The dataset was randomly split into a 70% training and 30% validation set. Model performance was assessed overall and stratified by race (White, Black, Hispanic, Asian) and gender. Evaluation metrics included:

-Discrimination (AUC-ROC)

-Calibration (Brier score, calibration plots)

-Fairness metrics: Differences in true positive rates (equal opportunity)

Results - The final cohort included 6,327 AML patients (median age: 67; 53% male; 74% White, 10% Black, 9% Hispanic, 5% Asian). Overall, the model performed well (AUC = 0.81; Brier score = 0.14). However, subgroup analysis revealed performance discrepancies -

1. AUC in Black patients was significantly lower (0.71) than in White (0.83), Hispanic (0.79), and Asian (0.82) patients.

2. The model underestimated survival in Black patients and women, as shown in calibration curves and survival probability distributions.

3. The true positive rate (TPR) for 1-year survival prediction was 11.4% lower in Black than White patients, indicating a meaningful fairness gap.

Conclusion - This study demonstrates that while ML models can accurately predict AML survival at a population level, their performance can vary meaningfully across race and gender. In our SEER-based validation, Black patients and women were more likely to be under-classified for survival, despite the model's strong overall metrics. These findings underscore the importance of evaluating AI tools not just for accuracy, but for equity. Future research should focus on integration of equity centered frameworks into AI model development to ensure that prediction tools serve all patients effectively and equitably.

This content is only available as a PDF.
Sign in via your Institution